Roi Reichart 1 / 5 RESEARCH STATEMENT – Roi Reichart
نویسنده
چکیده
Natural Language processing (NLP) is a field that combines linguistics, cognitive science, statistical machine learning and other computer science areas in order to compile intelligent computer systems that can understand human languages. NLP has various applications, among which are machine translation, question answering and search engines. The field of NLP has, in the past two decades, come to simultaneously rely on and challenge the field of machine learning. Statistical methods now dominate NLP, and have moved the field forward substantially, opening up new possibilities for the exploitation of data in developing NLP components and applications. Many state of the art natural language algorithms are based on supervised learning techniques. In this type of learning, a corpus consisting of texts annotated by human experts is compiled and used to train a learning algorithm. While supervised learning has made substantial contribution to NLP, it faces some significant challenges. Many fundamental NLP tasks, such as syntactic parsing, part-of-speech (POS) tagging and machine translation, involve structured prediction and sequential labeling. For tasks of this nature, an annotation bottleneck exists compiling annotated corpora is costly and error prone due to the complex nature of annotation. A closely related challenge is that of domain adaptation. Supervised algorithms usually perform well when the training data and the data for which they should provide predictions (the test data) are drawn from similar domains. When an algorithm trained with data from one domain is to provide predictions for data taken from a substantially different domain, its performance markedly degrades. Again,the burden involved in manually annotating data dims it unfeasible to create sufficient examples for a wide range of domains. Supervised natural language learning is also challenged by methodological problems. Annotation schemes for tasks such as syntactic parsing are often based on arbitrary decisions. Such schemes often provide a detailed description of certain structures while addressing others only briefly. Many applications would benefit from different annotation decisions. My research focuses on developing machine learning techniques which deal with these challenges. Its main theme is utilizing the plentiful amounts of raw text available nowadays, for creating state of the art algorithms that use little to no manually annotated data. Cases where little amounts of manually annotated text is utilized are referred to as semi-supervised learning [1]; while cases where only raw text is used are known as unsupervised learning [2, 3]. I have explored semi-supervised and unsupervised techniques for NLP tasks which involve structured prediction and sequence labeling, mainly syntactic parsing and POS tagging.
منابع مشابه
Reconstructing Native Language Typology from Foreign Language Usage
Linguists and psychologists have long been studying cross-linguistic transfer, the influence of native language properties on linguistic performance in a foreign language. In this work we provide empirical evidence for this process in the form of a strong correlation between language similarities derived from structural features in English as Second Language (ESL) texts and equivalent similarit...
متن کاملImproved Unsupervised POS Induction through Prototype Discovery
We present a novel fully unsupervised algorithm for POS induction from plain text, motivated by the cognitive notion of prototypes. The algorithm first identifies landmark clusters of words, serving as the cores of the induced POS categories. The rest of the words are subsequently mapped to these clusters. We utilize morphological and distributional representations computed in a fully unsupervi...
متن کاملSurvey on the Use of Typological Information in Natural Language Processing
In recent years linguistic typology, which classifies the world’s languages according to their functional and structural properties, has been widely used to support multilingual NLP. While the growing importance of typological information in supporting multilingual tasks has been recognised, no systematic survey of existing typological resources and their use in NLP has been published. This pap...
متن کاملEffective Greedy Inference for Graph-based Non-Projective Dependency Parsing
Exact inference in high-order graph-based non-projective dependency parsing is intractable. Hence, sophisticated approximation techniques based on algorithms such as belief propagation and dual decomposition have been employed. In contrast, we propose a simple greedy search approximation for this problem which is very intuitive and easy to implement. We implement the algorithm within the second...
متن کاملImproved Lexical Acquisition through DPP-based Verb Clustering
Subcategorization frames (SCFs), selectional preferences (SPs) and verb classes capture related aspects of the predicateargument structure. We present the first unified framework for unsupervised learning of these three types of information. We show how to utilize Determinantal Point Processes (DPPs), elegant probabilistic models that are defined over the possible subsets of a given dataset and...
متن کامل